The GUC Goes to TREC 2004: Using Whole or Partial Documents for Retrieval and Classification in the Genomics Track

نویسندگان

  • Kareem Darwish
  • Amgad Madkour
چکیده

We were interested in examining the relative effect of using parts of the documents, different combinations of parts of the documents, or whole documents on retrieval and classification. We were also interested in the effect of MeSH terms on retrieval. Our experiments show that indexing titles, abstracts, and MeSH terms for adhoc retrieval yielded statistically significantly better results than any other part or combination of parts, with abstracts outperforming any other individual part of the documents. In the triage sub-task, using whole documents for training a classifier outperformed using titles, abstracts, diagram captions, MeSH terms, and windows of text around gene names. However, training a classifier using the combination of titles, abstracts, and MeSH terms produced results comparable to using whole documents.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

TREC 2004 Genomics Track Overview

The TREC 2004 Genomics Track consisted of two tasks. The first task was a standard ad hoc retrieval task using topics obtained from real biomedical research scientists and documents from a large subset of the MEDLINE bibliographic database. The second task focused on categorization of full-text documents, simulating the task of curators of the Mouse Genome Informatics (MGI) system and consistin...

متن کامل

DIMACS at the TREC 2004 Genomics Track

DIMACS participated in the text categorization and ad hoc retrieval tasks of the TREC 2004 Genomics track. For the categorization task, we tackled the triage and annotation hierarchy subtasks. 1. TEXT CATEGORIZATION TASK The Mouse Genome Informatics (MGI) project of the Jackson Laboratory provides data on the genetics, genomics, and biology of the laboratory mouse. In particular, the Mouse Geno...

متن کامل

TREC Genomics 2004

The TREC Genomics track started in 2003 as the first domain specific track of the Text Retrieval Competition. The aim of the track is to develop various IR tasks specific to the biomedical field. One task of the first year involved the retrieval of documents given a specific gene, while the second task required the extraction a brief description of gene function from documents. This year sees a...

متن کامل

Enhancing access to the Bibliome: the TREC 2004 Genomics Track

BACKGROUND The goal of the TREC Genomics Track is to improve information retrieval in the area of genomics by creating test collections that will allow researchers to improve and better understand failures of their systems. The 2004 track included an ad hoc retrieval task, simulating use of a search engine to obtain documents about biomedical topics. This paper describes the Genomics Track of t...

متن کامل

The TREC 2004 genomics track categorization task: classifying full text biomedical documents

BACKGROUND The TREC 2004 Genomics Track focused on applying information retrieval and text mining techniques to improve the use of genomic information in biomedicine. The Genomics Track consisted of two main tasks, ad hoc retrieval and document categorization. In this paper, we describe the categorization task, which focused on the classification of full-text documents, simulating the task of c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004